An adaptive Metropolis-Hastings scheme: 
sampling and optimization 

David H. Wolpert 1 * and Chili Fan Lee 2t 

X NASA Ames Research Center 
MailStop 269-1, Moffett Field, CA 94035-1000 

2 Physics Department 
Clarendon Laboratory, Oxford University 
Parks Road, Oxford 0X1 3PU, U.K. 

February 2, 2008 



Abstract 

We propose an adaptive Metropolis-Hastings algorithm in which 
sampled data are used to update the proposal distribution. We use 
the samples found by the algorithm at a particular step to form the 
information-theoretically optimal mean-field approximation to the tar- 
get distribution, and update the proposal distribution to be that ap- 
proximatio. We employ our algorithm to sample the energy distri- 
bution for several spin-glasses and we demonstrate the superiority of 
our algorithm to the conventional MH algorithm in sampling and in 
annealing optimization. 
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Monte Carlo methods are powerful tool for evaluating integrals 
and simulating stochastic systems (for a recent review, see pQ), and 
they have proved very useful for studying thermodynamic properties 
of model systems with many degrees of freedom. The core of any such 
method is an algorithm for producing independently and identically 
distributed (IID) samples of a provided target probability distribution 
ir(x £ X). One of the most popular method used is the Metropolis- 
Hastings (MH) algorithm 2 and many of its variants |3J 03 EI] • The 
Markov transition matrix produced by this algorithm is parameterized 
by a proposal distribution T(x,x'). Typically T is set before the start 
of the Markov chain in a 7r-independent manner and fixed throughout 
the running of that chain. The rate at which the associated Markov 
chain converges to the desired IID sample is crucially dependent on 
the relation between T and ir however. Since the set {x(t)} produced 
by the MH algorithm is (eventually) an IID sample of ir, one can use 
{x(t)} to produce an empirical estimate of ir |7|. This suggests that 
we empirically update T along the Markov chain to be an increasingly 
accurate estimate of ir. An adaptive version of the MH algorithm was 
introduced in ^Hj. Here we investigate that algorithm and demon- 
strate its superiority through spin glass experiments. A full version of 
this paper can be found at http://tc.arc.nasa.gov/dhw. 



1 Sampling 

For many systems of interest in physics, X is high-dimensional (e.g., 
the number of spins in an Ising system), and for the density estima- 
tion of ir to work well it must be restricted to producing estimates 
from a relatively low-dimensional space, Q. Intuitively, the idea is 
to try to find the q € Q that is "closest" to ir and use that to up- 
date T, presuming that this will produce the most quickly converging 
Markov chain. We generically call such algorithms Adaptive Metropo- 
lis Hastings (AMH). To specify an AMH algorithm one must fix the 
measure of closeness, the choice of Q, and the precise details of the 
resultant density estimation algorithm. One must then specify how 
the estimates of ir{x) are used to update T(x,x'). 

The most popular way to measure closeness between probability 
distributions is with the (asymmetric) Kullback-Leibler (KL) distance 

m 

D(p\\ P ')= -X>(*)log^M (1) 

Recent work in Probability Collectives (Hj provides insight into how 
to do density estimation to minimize KL distance when Q is low- 
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dimensional. In particular, say Q is the set of all product distributions 
over X, q(x) = J[ i qi(xi), which is equivalent to the familiar mean-field 
approximation jllj . Then D{q\\Tr) is minimized if 

(hip-) oc e ^d°g(-)l^) , Vi (2) 

which E(.\.) is the expected value of the first entry conditional on the 
fixed second entry ^01- Accordingly, D(q\\ir) is just the associated 
free energy of q if tt is a Boltzmann distribution. 

Here we instead consider D(Tr\\q), which it can be argued is more 
appropriate, in light of the information-theoretic justification of KL 
distance. The product distribution minimizing this distance can be 
written down directly: it has the same marginals as tt, i.e., the op- 
timal q obeys qi = 7Tj, \/i 10 . Now a Markov chain produced by 
a run of the conventional MH algorithm converges to an IID sample 
of tt. So the i'th component of the elements of that chain, 
become an IID sample of TTi. Accordingly, those sample-components 
give us an IID sample of the distribution minimizing D{Tr\\q). So if 
the number of possible Xi values is not too large, we can use simple 
histogramming of the elements of the Markov chain produced by the 
MH algorithm to form our estimate of each marginal 7Tj, and therefore 
of the q minimizing D(q\\ir). This gives us a rule for updating the 
proposal distribution, P(T* | {x(t)}). 

There are a number of subtleties one should account for in choosing 
the precise details of P(T l \ {x(t)}). In practice there is almost always 
substantial discrepancy between tt and q, since Q is a small subset of 
the set of all possible tt. This means that setting T(x, y) = q(y) 
typically results in frequent rejections of the sample points. The usual 
way around this problem in conventional MH (where T is fixed before 
the Markov process starts, and therefore is typically an even worse fit 
to tt) is to use a (i-independent) T{x, y) that forces x and y to be close 
to one another. Intuitively, doing this means that once the Markov 
chain finds an x with high tt(x), the y's proposed by T(x,y) will also 
have reasonable high probability (assuming tt is not too jagged). We 
integrate this approach into our AMH algorithm by setting T*(x, y) 
to be q t (y) "masked" and renormalized to force y to be close to x. 

Another important issue is that the earlier a point is on the Markov 
chain, the worse it serves as a sample of tt. To account for this, one 
should not form qi{xi = s) at time n simply as the fraction of all 
points t < n for which Xi(t) = s. Instead we form those estimates 
by geometrically aging the importance of the points before evaluating 
the fraction. This means that more recent points have more of an 
effect on our estimate of tt. This aging has the additional advantage 
that it makes the evolution of the proposal distribution a relatively 
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low-dimensional Markov process, which intuitively should help speed 
convergence. 

In [5J H)] related ideas of how to exploit online-approximations of it 
that are generated from the random walk were explored. None of that 
work explicitly considers information-theoretic measures of distance 
(like KL distance) from the approximation to it. Nor is there any 
concern to "mask" the estimate of tt in that work. The algorithms 
considered in that work also make no attempt to account for the fact 
that the early x(t) should be discounted relative to the later ones. In 
addition, not using product distributions, parallelization would not be 
as straightforward with these alternatives schemes. 



Our AMH algorithm 

Our proposed algorithm consists of three successive phases: the first of 
these is the cooling phase and the third is the data collecting phase. In 
both of those phases, the conventional Metropolis-Hastings algorithm 
is used, i.e., there is no updating on the proposal distribution. The 
second phase is where the proposal distribution is adaptively updated. 
The details are presented below: 

Let N be the number of components of x and q t the estimate of tt 
at the i'th step of the walk. We consider the following algorithm: 

1. Set T*(x,y) to q t (y) masked so that y and x differ in only one 
component: 



/ N \ N 

T\x, y) cx 5 £ S(xi - Vi ) - N + 1 J] q\{ yi 

\t=l / k=l 



(3) 



2. As in conventional MH, sample [0, 1] uniformly to produce a r 
and set 

x(t + l) = l y > *r <!*(*(*), y) (4) 
[ x(t), otherwise 



where 



pt , x • L Ay^jy^x) \ 

R (x,y) = mm 1 , r ■ (5) 

7r(x)T*(x,y) I 



3. Only in phase 2: 

Periodically update q. If modAr(t + 1) = 0, then update the set 
{q\} by the non-negative multiplier a < 1: 
For all i,x'i, if x\ = Xi(t) 



c 

otherwise 



q t + 1 {x> l )=a{q\(x[)-l) + l (6) 



q t +\x> l )=aq\{x' i ) (7) 
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If modjv(t + 1) 7^ 0, then qj + (x^) = <?*( X D- To avoid freezing 
the proposal distribution, qi is not allowed to get too close to 
the boundary of the probability simplex (i.e., less than 0.2 x the 
initial uniform distribution). 

4. t <— t + 1. Repeat from step 1. 

We note again that in the the first phase of the algorithm, T is uniform 
and step 3 is not implemented, in the second phase, all steps above are 
implemented and in the third phase, step 3 is not implemented. We 
also note that our updating method depend only on the current state 
and hence is much more efficient than previously proposed methods 



Sampling Experiments 

Currently there is no consensus on how to quantify "how close" a set 
{x(t)} is to an IID sample of tt. One approach is to input the set into a 
density estimation algorithm 0. One can then use KL distance from 
that estimated distribution to tt as the desired quantification. This can 
be problematic in high-dimensional spaces though, where the choice of 
density estimation algorithm would be crucial. However say we have a 
contractive mapping F : x E X — > y EY where Y is a low-dimensional 
space that captures those aspects of X that are of most interest. We 
can apply F to the {x(t)} to produce its image in Y, {y(t)}. Next 
one can apply something as simple and (relatively) unobjectionable 
as histogramming to do the density estimation translating {y(t)} to 
an associated estimate of the generating distribution over Y. We 
can then use KL distance between that histogram and F(tt) as the 
desired quantification of how good our transition matrix is. This is 
the approach we took here. 

An important issue with KL distance is that the KL distance from 
(a density of) the sample points to tt diverges if no samples are ob- 
tained in region where tt is substantially non-zero. This is not a seri- 
ous problem if the total probability, e, of such regions KL divergences 
is negligible because the discrepany obtained on any expected value 
calculations will be bound by e. Figure 2 illustrates how AMH and 
conventional MH compare in their values of e. 

Our first experiment concerns the Ising spin-glass model: 



where < i,j > denotes summation over all neighbours. In this function 
the Jij and hi are randomly generated integers in the interval [— 5 , 5] 



0. 





<ij> 
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and the x% can take on values —1 and 1. Our task is to sample the 
associated Boltzmann distribution: 

tt(x) oc exp(-F(x)/T) (9) 

where T corresponds to the temperature in a thermodynamic setting. 
We have chosen spin-glasses for illustration because it is generally be- 
lieved that they display salient features of complex disordered systems 

E2. 

We have performed experiments on spin-glasses in a ID ring for- 
mation (with 50, 75 and 100 spins shown in Figure 1). In these ex- 
periments, we firstly run, with random initial states, 5 long Markov 
chains (800,000 xJV steps where N is the number of spins and data are 
collected at the last quarter of chain) with the conventional MH algo- 
rithm. We then average the energy distributions obtained to form our 
target distribution. Its closeness to the true distribution is suggested 
by how small the associated KL distances to the original distributions 
are (the bottom three lines in Figure 1). 

We then produced 100 samples of energy distributions with the MH 
and the adaptive MH methods, with chains of 40,000 xN steps each. 
We note that in the adaptive MH method, qi(xi) corresponds to the 
the probability of spin i being in state X{. Data are again collected in 
the last quarter of each chain in both case, and we performed proposal 
distribution updates as detailed before in the third quarter of the 
chains in the adaptive M-H case (with the updating parameter a = 
0.98). 

Figures 1 and 2 show the results of these experiments with the 
error bars being the errors on the means. We see that AMH (with o 
markers) outperforms conventional MH (with * markers) in sampling, 
as well as in avoiding KL divergence. Similar experiments on a 2D 
lattice have also been performed and the adaptive M-H shows similar 
superior performance over conventional MH. 

Optimization Experiments 

We now investigate the performace of using our algorithm for opti- 
mization rather than sampling, i.e., for minimizing energy. We con- 
sider the same problem as before, with 100 spins. In the simulation we 
randomly generate 20 different sets of {J, h} for the Hamiltonian in 
eq. |H1 The temperature goes from 1 to 0.05 in 19 equal steps. We pro- 
duce 50 samples each for the MH and AMH versions of the algorithm; 
the results are presented in Fig. 3. 
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2 Conclusion 



We have proposed a new adaptive Metropolis-Hastings which, with the 
product distribution assumption, is easy to implement in sampling and 
we have shown its superiority over conventional Metropolis-Hastings 
with computer experiments. Compared with adaptive Metropolis- 
Hastings proposals [sHHU, we have demonstrated the usefulness of our 
proposed algorithm with highly non-trivial examples, i.e., spin-glasses, 
which highlights the usefulness of our proposed algorithm for sampling 
complex distribution. With annealing in temperature, our method is 
also shown to be useful in hard optimization. 

Acknowledgements: We thank Bill Macready for stimulating 
discussion. C.F.L. thanks NASA, University College (Oxford) and 
NSERC (Canada) for finanical support. 
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Figure 1: KL distances between the random walk points and the target 
density .(* = MH, o = AMH, + = MH with long chains. The error bars 
are errors on the means.) The inset plot shows the percentages of deviation 
with respect to the free energies found with the long chains in the 100-spin 
system. (Error bars are smaller than the markers.) 
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gure 2: (* = MH, o = AMH. The error bars are errors on the means.) 
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1.8 2 2.2 2.4 2.6 2.8 

Temperature T 



10 



Figure 3: Results of 20 different spin-glass experiments. Constants are added 
to the y-axis so that the minimum energies found by AMH are zero. (* = 
MH, o = AMH. The error bars are errors on the means.) 
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